首页> 外文OA文献 >Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor
【2h】

Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor

机译:结合基于内容的启发式和基于URL的启发式,使用Bitextor从多语言站点中获取对齐的bitexts

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Nowadays, many websites in the Internet are multilingual and may be considered sources of parallel corpora. In this paper we will describe the free/open-source tool Bitextor, created to harvest aligned bitexts from these multilingual websites, which may be used to train corpusbased machine translation systems. This tool uses the work developed in previous approaches with modifications and improvements in order to obtain a tool as adaptable as possible to make it easier to process any kind of websites and work with any pairs of languages. Content-based and URL-based heuristics and algorithms applied to identify and align the parallel web pages in a website will be described and, finally, some results will be presented to show the functionality of the application and set the future work lines for this project.
机译:如今,Internet上的许多网站都是多语言的,可能被视为并行语料库的来源。在本文中,我们将描述免费/开源工具Bitextor,该工具的创建是为了从这些多语言网站中获取对齐的bitexts,该工具可用于训练基于语料库的机器翻译系统。该工具使用先前方法中开发的工作进行了修改和改进,以便获得一种尽可能适应性强的工具,以使其更易于处理任何类型的网站并使用任何语言对。将描述基于内容和基于URL的启发式方法以及用于识别和对齐网站中并行网页的算法,最后,将提供一些结果以显示应用程序的功能并为该项目设置未来的工作方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号